attention operation
bounds through a sunlit park wearing a yellow sweater prompt a joyful Corgi with a fluffy coat and perky a young woman with curly hair and a bright smile
Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet this the problem, y often struggle we propose with V accelerating ORTA, an acceleration the long-range frame computati work with on. T tw o o address novel components: (1) a sparse attention mechanism that efficiently captures long-range dependencies, and (2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup 1 grate .76 with without various loss other of quality acceleration on VBench.
Scalable In-context Ranking with Generative Models
In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that BlockRank Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.
Large-Scale In-Game Outcome Forecasting for Match, Team and Players in Football using an Axial Transformer Neural Network
Horton, Michael, Lucey, Patrick
Football (soccer) is a sport that is characterised by complex game play, where players perform a variety of actions, such as passes, shots, tackles, fouls, in order to score goals, and ultimately win matches. Accurately forecasting the total number of each action that each player will complete during a match is desirable for a variety of applications, including tactical decision-making, sports betting, and for television broadcast commentary and analysis. Such predictions must consider the game state, the ability and skill of the players in both teams, the interactions between the players, and the temporal dynamics of the game as it develops. In this paper, we present a transformer-based neural network that jointly and recurrently predicts the expected totals for thirteen individual actions at multiple time-steps during the match, and where predictions are made for each individual player, each team and at the game-level. The neural network is based on an \emph{axial transformer} that efficiently captures the temporal dynamics as the game progresses, and the interactions between the players at each time-step. We present a novel axial transformer design that we show is equivalent to a regular sequential transformer, and the design performs well experimentally. We show empirically that the model can make consistent and reliable predictions, and efficiently makes $\sim$75,000 live predictions at low latency for each game.
Auto Learning Attention Benteng Ma
Attention modules have been demonstrated effective in strengthening the representation ability of a neural network via reweighting spatial or channel features or stacking both operations sequentially. However, designing the structures of different attention operations requires a bulk of computation and extensive expertise.
Fair comparison and ablation study
The results on CIFAR10 were listed in Table R1. It reveals that HOGA searched by AutoLA (k=4)) still outperforms SE and CBAM by a large margin. We further customized SE and CBAM using the group split operation (denoted by "HOG"), resulting in a specific The HOGA searched by AutoLA outperforms its randomly search counterparts (denoted by "Rand"). We tested the generalization ability of HOGA searched on ResNet56 (denoted by "AutoLA_56") WiderResNet, indicating the consistent superiority of the HOGA searched by AutoLA over previous attention methods. We also compared AutoLA with SE and CBAM on a larger backbone (e.g., The results in Table R3 suggest that AutoLA still outperforms other attention modules.